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Can Teachers be Evaluated by their Students’ Test Scores? Should They Be? 



Introduction 

“Value-added” measures of teacher effectiveness 
are the centerpiece of a national movement to 
evaluate, promote, compensate, and dismiss 
teachers based in part on their students’ test 
results. Federal, state, and local policy-makers 
have embraced these measures in recent years as 
a means to objectively quantify teacher quality 
and to identify, reward, and retain teachers with 
a demonstrated record of success. For example, 
in New York City, the Department of Educa- 
tion now releases “Teacher Data Reports” to its 
teachers in grades four to eight that concisely 
summarize teachers’ value-added information. 

In Washington, D.C., and FFouston, teachers 
can be granted or denied tenure partially based 
on value-added, and FFouston awards bonuses 
to its high value-added teachers. 

In theory, a teacher’s value-added is the unique 
contribution she makes to her students’ aca- 
demic progress - that is, the portion of her stu- 
dents’ achievement that cannot be attributed to 
any other current or past student, family, 
teacher, school, peer, or community influence. 
Because students are rarely assigned randomly 
to teachers, value-added measures must rely on 
complex statistical models to infer how much 
better or worse a student performed under one 
teacher than they would have performed under 
another. The ultimate goal of these tools, then, 
is to differentiate the causal impact of individ- 
ual teachers on student outcomes. 

Few can deny the intuitive appeal of value- 
added assessment: if a statistical model can iso- 
late a teacher’s unique effect on achievement, 
the possibilities seem endless. Teacher quality 
is an immensely important resource, and 



research has found that teachers can and do 
vary widely in their effectiveness (e.g., Kane, 
Rockoff & Staiger 2008). Common measures 
of teacher qualifications, such as experience 
and college selectivity, typically provide mini- 
mal information about individual teachers’ 
effectiveness. Value-added holds out the prom- 
ise that the elusive concept of “teacher quality” 
can be objectively and precisely measured. 

FFowever, these tools have limitations and 
shortcomings that are not always apparent to 
interested stakeholders - including teachers, 
principals, and policy-makers - or even to 
value-added advocates. In the report this exec- 
utive summary is based on, I provide an intro- 
duction to these new measures of teaching 
effectiveness; describe prominent value-added 
systems currently in use in New York City and 
FFouston; assess the potential for value-added 
measurement to improve student outcomes, 
using these programs as empirical case studies; 
and outline some important challenges facing 
their implementation in practice. This execu- 
tive summary summarizes these concepts and 
findings; for more detailed background on the 
New York City and FFouston programs, data 
analysis, and discussion, see the full report at 
<www.annenberginstitute.org/Products/ 
Corcoran. php>. 



What is a Teacher's Value-Added? 

A teacher’s value-added can be thought of as her 
students’ average test scores “properly adjusted” 
for the effects of other influences on achieve- 
ment. For example, in New York City’s Teacher 
Data Reports, students’ actual scores under a 
given teacher are compared to their predicted 
score - that is, their predicted achievement had 
they been taught by another teacher in the dis- 
trict (say, the average teacher). This prediction 
is based on a number of factors, the most 
important of which is the student’s prior 
achievement. Because the school district has 
richly detailed data on thousands of students’ 
academic histories, it can provide a statistical 
estimate of how each student is likely to have 
performed on a test given their background 
characteristics. How a student actually performs 
under a teacher relative to this prediction is the 
teacher’s value-added for that student. 

Though not always obvious to most observers, 
value-added in practice is a relative concept. It 
tells us how teachers measure up when com- 
pared with other teachers in the district or state 
with similar students. On the New York City 
Teacher Data Report, this is reported as the 
teacher’s percentile in the distribution of teach- 
ers with similar experience in the same grade 
and subject. For example, based on last year’s 
test results, an eighth-grade math teacher’s 
value-added might place him at the 43rd per- 
centile citywide. In other words, 43 percent of 
teachers had lower value-added than he did 
(and 57 percent had higher value-added). This 
percentile is then mapped to one of five per- 
formance categories (“high,” “above average,” 
“average,” “below average,” and “low”). New 
York City has recently encouraged principals to 
use these metrics in making teacher tenure deci- 



sions. In Houston’s ASPIRE (Accelerating Stu- 
dent Progress, Increasing Results & Expecta- 
tions) program, value-added measures are used 
in tenure decisions as well as in a system of 
bonus payments; teachers scoring in the top 
performance categories are awarded bonuses as 
high as $10,300 per year. 

Because value-added is statistically estimated, it 
is subject to uncertainty, or a “margin of error.” 
On the Teacher Data Report, this is reported as 
a range of possible percentiles associated with 
the value-added score (also called the per- 
centile’s “confidence interval”). For example, a 
teacher at the 43rd percentile might have a 
range that extends from the 1 5th percentile to 
the 71st. This means that the statistical model 
cannot rule out the possibility that this teacher 
falls somewhere between the 15th and 71st 
percentiles, although the 43rd is her “most 
likely” ranking. New York’s reports include 
value-added measures, percentiles, and per- 
formance categories for all tested students, as 
well as for subgroups of students, such as ini- 
tially low- achieving students or English lan- 
guage learners (ELLs). 
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Challenges to the Practical 
Implementation of Value-Added 

I categorize the conceptual and practical chal- 
lenges to value-added methods of evaluating 
teachers into six key questions: 

• What is being measured? 

• Is the measurement tool appropriate? 

• Can a teacher’s unique effect be isolated? 

• Who counts? 

• Are value-added scores precise enough to be 
useful? 

• Is value-added stable from year to year? 

What is being measured? 

Value-added measurement works best when 
students receive a single numeric test score 
every year on a continuous developmental scale 
- that is, one that does not depend on grade- 
specific content, but rather progresses across 
multiple grade levels. The set of skills and sub- 
jects that can be adequately assessed in this way 
is remarkably small. Not all subjects are or can 
be tested, and even within tested subject areas, 
only certain skills readily conform to standard- 
ized testing. Yet, valid value-added measures 
depend entirely on such tests. Houston’s 
ASPIRE program currently incorporates results 
from two sets of core subject tests into its 
value-added system (reading, math, science, 
social studies, and language arts). New York 
City strictly relies on the state’s math and Eng- 
lish language arts exams. Neither the Texas nor 
the New York state test was designed on a con- 
tinuous developmental (or “vertically equated”) 
scale. 



Is the measurement tool appropriate? 

In assessing a broad set of skills, an instrument 
must be devised that provides a valid and reli- 
able inference about students’ mastery of those 
skills. No test will cover all of the standards 
that students are expected to master. By neces- 
sity, a test instrument must sample items from 
a much broader domain of skills. Only by 
drawing an even and representative sample 
from this broader domain can a test provide a 
valid inference about student learning in that 
domain. 

However, such tests are the exception, not the 
rule. Many skills simply are not amenable to 
standardized tests and, inevitably, are underrep- 
resented on the test. Many skills that can be 
tested never appear on the test. Others are 
over-represented on the test. Teachers aware of 
systematic omissions and repetitions can sub- 
stantially inflate scores by narrowly focusing on 
these items or by “teaching to the format” of 
the test. Recent studies of the New York, Texas, 
and Massachusetts tests find that some parts of 
the state curriculum never appear on the test 
(Jennings & Bearak 2010; Holcombe, Jennings 
& Koretz 2010). For example, 50 percent of 
the possible points on the 2009 New York 
eighth-grade math test were based on only 
seven of the forty-eight state standards; only a 
score of 5 1 percent was required to pass. 

A useful way to look at the importance of the 
test itself is to compare value-added calcula- 
tions from more than one test. Since 1998, 
Houston has administered two standardized 
tests annually: the Texas Assessment of Knowl- 
edge and Skills (TAKS) and the nationally 
normed Stanford Achievement Test. Using 
Houston data, I calculated separate value- 
added measures for fourth- and fifth-grade 
teachers on the two tests in the same subject. 
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using the same students, tested at approximately 
the same time of year. Teachers who had high 
value-added on one test tended to score well on 
the other, but there were many inconsistencies. 
Many teachers who scored in the top category 
of the TAKS reading test ranked among the 
lowest categories on the Stanford test, and vice 
versa. 

In a related study, Papay (2010) calculated 
ASPIRE bonuses using value-added estimates 
from separate tests and found that “simply 
switching the outcome measure would affect 
the performance bonus for nearly half of all 
teachers and the average teacher’s salary would 
change by more than $2,000” (p. 3). Such wild 
inconsistencies certainly run counter to the 
intended goals of value-added assessment. 

Can a teacher's unique effect be isolated? 

The successful use of value-added requires a 
high level of confidence in the attribution of 
achievement gains to specific teachers. One 
must be confident that other explanations for 
test score gains have been accounted for before 
rewarding or punishing teachers based on these 
measures. In practice, there are a countless 
number of factors that hinder our ability to iso- 
late a teacher’s unique effect on achievement. 

Given one year of test score gains, it is impossi- 
ble to distinguish between the teacher’s effect 
and other classroom-level factors. Over many 
years, unusual swings average out, making it 
easier to infer teachers’ own effects, but this is 
of little comfort to a teacher or school leader 
looking for actionable information today. 

What is more, teachers with the fewest years 
of data - novice teachers - arguably have the 
most to gain from feedback on their perform- 
ance. Yet the value-added scores for these teach- 
ers are the least reliable. 



Most value-added systems in practice - includ- 
ing New York City’s - fail to separate teachers’ 
influences from school-level effects on achieve- 
ment. But performance differs systematically 
across schools due to differences in school pol- 
icy, leadership, discipline, staff quality, and stu- 
dent mix. Recent research suggests that school 
factors can and do affect teachers’ value-added. 
Jackson and Bruegmann (2009) found that 
students perform better when their teachers 
have had more effective colleagues. Other stud- 
ies have found effects of principal leadership on 
student outcomes (Clark, Martorell & Rockoff 
2009). Consequently, teachers rewarded or 
punished for value-added may be rewarded or 
punished, in part, based on the colleagues with 
whom they work. 

Who counts? 

Value-added systems, in practice, ignore a large 
fraction of the educational enterprise. Only a 
minority of teachers teach subjects amenable to 
standardized testing; not all students are tested; 
and not all tested students contribute to value- 
added scores. From the standpoint of value- 
added assessment, these students and teachers 
do not count. 

In most states, students are tested in reading 
and math in grades three to eight and again in 
high school. Other subjects, including science 
and social studies, are tested less frequently. 
Because value-added requires last year’s test 
score, only teachers of reading and math in 
grades four to eight are typically assessed using 
value-added. Thus, elementary, middle school, 
and high school teachers of all subjects other 
than reading and math are ignored by value- 
added assessment. 
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Some students are routinely exempted from 
testing or, for one reason or another, are miss- 
ing a test score. Large urban districts often 
have a large number of these cases. I examined 
data from Houston to see how missing data 
can affect “who counts” toward a teacher’s 
value-added assessment. I looked at the per- 
centage of students in grades four to six over 
eight years of testing who were tested in two 
consecutive years and thus can contribute to a 
value-added score. Because of disabilities, lim- 
ited English ability, absenteeism, and other rea- 
sons, roughly 14 percent of students in Hous- 
ton lack a test score in any given year. As many 
as 16 percent of Black students lack scores, and 
close to 30 percent of recent immigrants are 
not tested. 

The percentage of students who have both a 
current and prior year test score is even lower. 
Only 66 percent of all students had both 
scores, a fraction that falls to 62 percent for 
Black students, 47 percent for ELL students, 
and 4 1 percent for recent immigrants. Thus, in 
a given year, depending on the group, 40 per- 
cent to 60 percent of students in this popula- 
tion do not count toward teachers’ value-added 
assessments. 

This issue is more than just a technical nui- 
sance. To the extent that districts reward or 
punish teachers on the basis of value-added, 
they risk ignoring teachers’ efforts with a sub- 
stantial share of their students and provide no 
incentive for teachers to invest in students who 
will not count. Unfortunately, districts like 
New York City and Houston have very large 
numbers of mobile, routinely exempted, and 
frequently absent students, and these students 
are unevenly distributed across schools and 
classrooms. Teachers serving these students in 



disproportionate numbers are most likely to be 
affected by a value-added system that - by 
necessity - ignores many of their students. 

Are value-added scores precise enough 
to be useful? 

Some uncertainty is inevitable in value-added 
measurement, but for practical purposes it is 
worth asking: Are value-added measures precise 
enough to be useful in high-stakes decision- 
making or for professional development? Using 
the example given earlier, a teacher ranked in 
the 43rd percentile on New York City’s Teacher 
Data Report might have a range of possible 
scores from the 15th to the 71st percentile 
after taking statistical uncertainty into account. 
What is the source of this imprecision? Recall 
that value-added measures are estimates of a 
teacher’s contribution to student test-score 
gains. The more certain we can be that gains 
are attributable to a specific teacher, the more 
precise our estimates will be. The best way to 
improve this certainty is to have more years of 
classroom test results. Thus, experienced teach- 
ers will tend to have more precise estimates 
than new teachers. 

To get a better sense of the average level of 
uncertainty in New York City’s Teacher Data 
Reports, I examined the full set of value-added 
estimates reported by that system in 2008- 
2009. As expected, the level of uncertainty is 
higher when only one year of test results is 
used versus three. But in both cases, the aver- 
age range of percentiles is very wide. For exam- 
ple, in math (and using all years of available 
data, which provides the most precise possible 
measures), the average range is about 34 per- 
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centage points (e.g., from the 46th to 80th 
percentile). When looking at only one year of 
test results, the average range increases to 6 1 
percentage points (e.g., from the 30th to the 
91st percentile). 

The average level of uncertainty is higher still 
in English language arts and in sections of the 
city with high levels of student mobility, such 
as the Bronx. Given the level of uncertainty 
reported in the data reports, half of all teachers 
in grades four to eight have wide enough per- 
formance ranges that they cannot be statisti- 
cally distinguished from 60 percent or more of 
all other teachers in the city. 

Using New York City’s performance categories, 
we cannot rule out the possibility that a 
teacher with a range of percentiles from 1 5 to 
71 is “below average,” “average,” or close to 
“above average.” It is unclear what this teacher 
or his principal can do with this information 
to improve instruction or raise student per- 
formance. More years of data help, but the 
promise that better data will be available in the 
future is of little use to a teacher looking for 
guidance in real time. Value-added results for 
student subgroups might hold greater promise, 
to the extent that they highlight areas in need 
of improvement. Yet in most cases, the number 
of students used to calculate these subgroup 
scores is so small that the resulting level of 
uncertainty renders them meaningless. 

It is interesting to point out that, by definition, 
50 percent of teachers will perennially fall in 
the “average” performance category on New 



York City’s Teacher Data Report. Another 40 
percent will be considered “below average” or 
“above average.” The remaining 10 percent are 
either exceptional (top 5 percent) or failing 
(bottom 5 percent). Thus, out of all teachers 
issued a value-added report each year, half will 
be told little more than that they are “average.” 
At most, one in three will receive a signal that 
improvement is needed, though high levels of 
uncertainty will raise some doubt about this 
signal. In no case will teachers be told what 
actions need to be taken. Of course, teachers 
persistently in the top 5 percent are almost cer- 
tainly worth recognizing; teachers persistently 
in the bottom 5 percent deserve immediate 
scrutiny. Still, it seems a great deal of effort has 
been expended to identify a very small fraction 
of teachers. In the end, a tool designed for dif- 
ferentiating teacher effectiveness has done very 
little of the sort. 

Is value-added stable from year to year? 

Given the extent of uncertainty in teacher 
value-added scores, it would not be surprising 
if these estimates fluctuated a great deal from 
year to year. In fact, this is generally what is 
observed in both Houston and New York City. 
In Houston, among those in the lowest 20 per- 
cent of value-added, only 36 percent remain 
among the lowest performers in the following 
year. Similarly, among those in the top 20 per- 
cent, only 38 percent remain among the top 
performers the next year. Twenty-three percent 
of last year’s lowest performers are among the 
top performers in the following year, and vice 
versa. A similar pattern holds in an analysis of 
New York City Teacher Data Report data. 

Again, imprecision and variability is reduced as 
additional years of classroom data accumulate. 
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But here again, this knowledge is of little use 
in real time. A top-performing teacher may be 
awarded (or punished) one year based on her 
latest round of test results, only to get the 
opposite feedback the following year. Wisely, 
districts that have adopted value-added systems 
- including New York City - caution users 
against making rash decisions based on one 
year of estimates. But, this estimate is one of 
only a few made available in value-added 
assessment systems. Inexperienced teachers - 
those most in need of immediate feedback - 
simply will not have the multiple years of 
data on which to rely. It seems unlikely that 
teachers and their school leaders will not pay 
close attention to these noisy and imprecise 
estimates. 

Discussion 

In the abstract, value-added assessment of 
teacher effectiveness has great potential to 
improve instruction and, ultimately, student 
achievement. The notion that a statistical 
model might be able to isolate each teacher’s 
unique contribution to his or her students’ 
educational outcomes - and by extension, 
their life chances - is a powerful one. With 
such information in hand, one could not 
only devise systems that reward teachers with 
demonstrated records of success in the class- 
room - and remove teachers who do not - but 
also create a school climate in which teachers 
and principals work constructively with their 
test results to make positive instructional and 
organizational changes. 



But the promise that value-added systems can 
provide such a precise, meaningful, and com- 
prehensive picture is much overblown. As this 
report argues, value-added assessments - like 
those reported in the New York City Teacher 
Data Reports and used to pay out bonuses in 
Houston’s ASPIRE program - are, at best, a 
crude indicator of the contribution that teach- 
ers make to their students’ academic outcomes. 
Moreover, the set of skills that can be ade- 
quately assessed in a manner appropriate for 
decisions based on value-added represents a 
small fraction of the goals our nation has set 
for our students and schools. 

The implementation of value-added systems 
faces many challenges. Not all students are 
tested, and many, if not a majority of teachers 
do not teach tested subjects. Students without 
a prior-year test score - such as chronically 
mobile students, exempted students, and those 
absent on the day of the test - simply do not 
count toward teachers’ value-added estimates. 
In many districts, these students constitute a 
substantial share of many teachers’ classrooms. 

Often, state tests are predictable in both con- 
tent and format, and value-added rankings will 
tend to reward those who take the time to 
master the predictability of the test. Evidence 
from Houston presented here showed that 
one’s perception of a teacher’s value-added can 
depend heavily on which test one looks at. 
Annual value-added estimates are highly vari- 
able from year to year and, in practice, many 
teachers cannot be statistically distinguished 
from the majority of their peers. Persistently 
exceptional or failing teachers - say, those in 
the top or bottom 5 percent - may be success- 
fully identified through value-added scores, but 
it seems unlikely that school leaders would not 
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already be aware of these teachers’ persistent 
successes or failures. 

Research on value-added remains in its infancy, 
and it is likely that these methods - and the 
tests on which they are based - will continue 
to improve over time. The simple fact that 
teachers and principals are receiving regular 
and timely feedback on their students’ achieve- 
ment is an accomplishment in and of itself, 
and it is hard to argue that stimulating conver- 
sation around improving student achievement 
is not a positive thing. But teachers, policy- 
makers, and school leaders should not be 
seduced by the elegant simplicity of “value- 
added.” Before adopting these measures whole- 
sale, policy-makers should be fully aware of 
their limitations and consider whether the 
minimal benefits of their adoption outweigh 
the cost. 



References 

Clark, Damon, Paco Martorell, and Jonah 
Rockoff. 2009. “School Principals and 
School Performance.” CALDER Working 
Paper No. 38. Washington, DC: Urban 
Institute. 

Holcombe, Rebecca, Jennifer L. Jennings, and 
Daniel Koretz. 2010. “Predictable Patterns 
that Facilitate Score Inflation: A Comparison 
of New York and Massachusetts.” Working 
Paper. Cambridge, MA: Harvard University. 

Jackson, C. Kirabo, and Elias Bruegmann. 

2009. “Teaching Students and Teaching 
Each Other: The Importance of Peer Learn- 
ing for Teachers,” American Economic jour- 
nal: Applied Economics 1:85-108. 

Jennings, Jennifer L., and Jonathan M. Bearak. 

2010. “Do Educators Teach to the Test?” 
Paper presented at the Annual Meeting of 
the American Sociological Association, 
Atlanta. 

Kane, Thomas J., Jonah E. Rockoff, and Dou- 
glas O. Staiger. 2008. “What Does Certifica- 
tion Tell Us about Teacher Effectiveness? Evi- 
dence from New York City j Economics of 
Education Review 27:615-631. 

Papay, John P. 2010. “Different Tests, Different 
Answers: The Stability of Teacher Value- 
Added Estimates Across Outcome Meas- 
ures,” American Education Research Journal, 
published online (April 19); print version 
forthcoming. 



8 



Can Teachers be Evaluated by their Students’ Test Scores? Should They Be? 



jjP^Vfc Annenberg 
YXA Institute lor 
School Reform 



AT BROWN UNIVERSITY 



Providence 

Brown University 
Box 1985 

Providence, RI 02912 
T 401.863.7990 
F 401.863.1290 

New York 

233 Broadway, Suite 720 
New York, NY 10279 
T 212.328.9290 
F 212.964.1057 

www.annenberginstitute.org 




